An Aggregated Mutual Information Based Feature Selection with Machine Learning Methods for Enhancing IoT Botnet Attack Detection

Due to the wide availability and usage of connected devices in Internet of Things (IoT) networks, the number of attacks on these networks is continually increasing. A particularly serious and dangerous type of attack in the IoT environment is the botnet attack, where the attackers can control the IoT systems to generate enormous networks of “bot” devices for generating malicious activities. To detect this type of attack, several Intrusion Detection Systems (IDSs) have been proposed for IoT networks based on machine learning and deep learning methods. As the main characteristics of IoT systems include their limited battery power and processor capacity, maximizing the efficiency of intrusion detection systems for IoT networks is still a research challenge. It is important to provide efficient and effective methods that use lower computational time and have high detection rates. This paper proposes an aggregated mutual information-based feature selection approach with machine learning methods to enhance detection of IoT botnet attacks. In this study, the N-BaIoT benchmark dataset was used to detect botnet attack types using real traffic data gathered from nine commercial IoT devices. The dataset includes binary and multi-class classifications. The feature selection method incorporates Mutual Information (MI) technique, Principal Component Analysis (PCA) and ANOVA f-test at finely-granulated detection level to select the relevant features for improving the performance of IoT Botnet classifiers. In the classification step, several ensemble and individual classifiers were used, including Random Forest (RF), XGBoost (XGB), Gaussian Naïve Bayes (GNB), k-Nearest Neighbor (k-NN), Logistic Regression (LR) and Support Vector Machine (SVM). The experimental results showed the efficiency and effectiveness of the proposed approach, which outperformed other techniques using various evaluation metrics.


Introduction
Internet of Things (IoT) networks are becoming essential components for different advanced applications such as smart cities and smart homes. They provide wide connectivity between the connected devices, with the number of networks growing exponentially every day [1]. The IoT improves the quality of life by providing different types of smart services and applications in several domains, including health care, automation, industrial processes and smart environments [2]. According to Greengard [3], it is predicted that 21.5 billion IoT devices will be used by 2025. This huge number of devices will be vulnerable to different types of attacks that raise several security and privacy issues.
With this rapid development in the internet and its smart connected devices, the number of attacks that affect individuals and businesses has already increased [4]. One of the main applications to improve information security is the use of what are called Intrusion Detection Systems (IDSs), which help to provide a secure environment by identifying and classifying security threats within the internet. Because of the special characteristics of IoT systems, including the dynamics of their networks, and limited battery power and processor capacity, intrusion detection for IoT networks is considered a major challenge, as it needs to consider the trade-off between accuracy of detection and performance overheads [5]. Thus, according to Arshad et al. [5], the main features of IDSs should be: (1) efficient computational and communication overhead, and (2) high detection accuracy.
One of the dangerous threats in IoT networks is what are known as botnets, which can be described as a collection of different bots that are controlled by the Botmaster (behindthe-scenes attacker) using the Command and Control (C&C) channel [6]. The IoT botnet attack works to recruit vulnerable IoT devices in order to generate enormous networks of "bot" devices to generate large numbers of malicious activities that can be controlled remotely by the Botmaster [7]. The attackers can use botnets for stealing data, granting access to devices and performing Distributed Denial-of-Service attacks (DDoS). This attack uses a series of connected devices in order to take down a website or networks for the purpose of disrupting operations in these environments or stopping the main services of the target application [7]. Therefore, detecting and preventing the botnets is very important in computer security and has attracted several researchers to improve the IoT botnet attack detection rate.
Recently, different methods have been proposed and applied to detect IoT botnet attacks. For instance, Popoola et al. [8] proposed a deep learning-based botnet attack detection method to deal with imbalanced traffic data in networks. They utilized a recurrent neural network method for learning hierarchical feature representations of the balanced data to carry out the classification. The authors found that this imbalanced data affected the detection performance, using evaluation measures such as precision, recall and F1 score. The proposed method obtained 99.50%, 99.75% and 99.62% for precision, recall and F1 scores, respectively. In addition, Soe et al. [9] proposed a botnet attack detection method based on Machine Learning (ML) and Sequential Architecture. In this work, the authors adopted a Feature Selection (FS) method to produce a high-performance and lightweight detection system. This system obtained an accuracy of 99% for detecting the botnet attacks using an artificial neural network, J48 decision tree and naïve Bayes. To compare the many machine learning methods that have been applied for botnet attack detection, Tuan et al. [10] conducted experiments for performance evaluation of several machine learning methods for botnet DDoS attack detection using two datasets. The experiments included the use of Support Vector Machine (SVM), Artificial Neural Network (ANN), Naïve Bayes (NB), Decision Tree (DT) and Unsupervised Learning (UL). The outcomes of this research showed that the unsupervised learning methods obtained better detection rates compared to the other machine learning methods.
As the main features of IDSs for IoT networks are the efficiency of the computational and communication overhead and the high detection accuracy [5], the high dimensionality of IoT traffic data affects the efficiency of the detection systems. This paper proposed an aggregated mutual information-based feature selection approach with machine learning methods to enhance the efficiency and performance of IoT botnet attack detection. A freely available benchmark dataset was used to show the benefit of the proposed aggregated feature selection method. Based on an intensively review of the existing available datasets, the N-BaIoT dataset (http://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_ attacks_N_BaIoT (last accessed on: 6 December 2021; 23:00 GMT)) was chosen to be used in this research.
The main contributions of this research paper can be summarized as follows: • The IoT Botnet attack detection is explored as a multiclass classification problem using a dataset with more than 6.2 M instances. The description of the dataset is presented in Section 3.1.

•
A feature selection-based method is proposed that incorporates Mutual Information (MI) technique, Principal Component Analysis (PCA) and an ANOVA f-test at finely granulated detection level. • A fine-granulated aggregated mutual information is proposed and tested on the benchmark dataset. The proposed technique effectively selects the relevant features for increasing the performance of IoT Botnet classifiers. • A comprehensive and practical approach is proposed that investigates the performance of the proposed technique using two ensemble-based machine learning methods, namely Random Forest (RF) and XGBoost (XGB), and four standalone classifiers, namely, Gaussian Naïve Bayes (GNB), k-Nearest Neighbor (k-NN), Logistic Regression (LR) and Support Vector Machine (SVM). • Finally, the proposed approach outperforms other techniques using various evaluation metrics.
The rest of the paper is organized as follows: Section 2 reviews the recent studies on IoT botnet attack detection. Section 3 presents the materials and methods used in the present study, while Section 4 highlights and discusses the main results of the proposed approach. Finally, Section 5 concludes the whole paper.

Related Works
Although the increased usage and growth of information and computer technology makes life easier, it also leads to many security issues as the number of attackers has increased rapidly. One of the important security mechanisms proposed to support information security and protect businesses from dangerous network attacks is known as the intrusion detection system [11]. Several intrusion detection systems based on machine learning and deep learning methods have been proposed for IoT Environments. For instance, Kiran et al. [12] applied NB, SVM, DT and Adaboost methods to detect the attacks (sniffing and poisoning) on IoT networks. They used IoT-based normal and attack data in order to build the model. The applied methods obtained high accuracy rates (0.9895, 0.9895 and 1.00 for SVM, Adaboost and DT respectively). However, these authors indicate that challenges still exist in generating high quality datasets using diverse IoT devices in order to enhance the robustness of the used machine learning models.
Pacheco et al. [13] proposed an artificial neural network-based method for implementing an adaptive IDS to detect attacks on fog nodes in IoT applications and ensure the availability of communication, allowing the nodes to continuously deliver the important information to the end users. The proposed method was able to detect the normal behavior of fog nodes and was able to detect anomalies due to different sources, such as misuses, cyber-attacks, with a high detection rate and low false alarms. In addition, Ferrag et al. [14] proposed an IDS for IoT networks called RDTIDS, which combines REP Tree, JRip algorithm and Random Forest methods. The proposed system used a BoT-IoT dataset and obtained high accuracy in the detection rate compared to the previous studies.
In another study, Amouri et al. [15] proposed an IDS for mobile IoT networks, which involved two stages: (1) Collecting data from dedicated sniffers and generating correctly classified instances that are sent to super node, (2) linear regression performed by the super node to detect the benign and malicious nodes. The proposed system was able to detect the malicious activities (blackhole and DDoS) attacks with detection rates of more than 98% for the high power/node velocity case and 90% for the low power/node velocity case. Similarly, Verma and Ranga [16] used different machine learning methods to detect Denial-of-Service (DoS) attacks on IoT networks. They used different popular datasets and applied statistical methods to evaluate the significant differences between the methods used. They discussed how to select the best classification method based on the application requirements and recommended using ensemble methods to develop IDSs. In addition, Hindy et al. [17] investigated six machine learning methods for an IoT intrusion detection system to detect one type of IoT attack, known as a Message Queuing Telemetry Transport (MQTT) attack. The results showed the effectiveness of the machine learning methods used and emphasized the importance of using flow-based features to detect MQTT-based attacks.
Lv et al. [18] proposed a misuse IDS that depends on specific attack signatures to detect normal and malicious activities, based on an extreme learning machine with a hybrid kernel function. They used the Kernel Principal Component Analysis (KPCA) method for feature selection and feature extraction of the intrusion detection data. The experimental results showed high detection rates and time-saving when using the proposed method. For IoT networks, Gad, Nashat and Barkat [19] used a chi-square feature selection method with different machine learning methods (using binary and multi-class data) on a dataset from a large-scale and diverse IoT network. The experiment showed that the XGBoost classifier outperformed other methods.
Feature selection methods were also used to enhance the detection of IoT botnet attacks. For instance, Alqahtani, Mathkour and Ben Ismail [20] concluded that it is still a challenge to develop an efficient IDS for IoT devices. To address this, they proposed a feature selection method (using a Fisher-score) with a genetic-based XGBoost classifier to obtain a subset of features for detecting IoT botnet attacks. They conducted experiments on a public botnet dataset and it was found that high detection rates were obtained by using only three features. Similarly, Bahşi, Nõmm and La Torre [21] investigated the importance of improved feature selection for reducing the number of features to detect the IoT bots. They showed that a small number of features can obtain high detection rates using a multi-class classifier such as a decision tree. In addition, Panda, Abd Allah and Hassanien [22] developed an efficient feature engineering model with machine learning and deep learning methods for detecting IoT-botnet attacks. To provide efficient detection, two feature engineering methods, K-Medoid sampling and scatter search-based, were applied to obtain optimal feature subsets for the representative dataset. The experimental results showed that the proposed method combined a high detection rate with low computational cost (4.7 s for training and 0.61 s for testing).
Feature selection methods were used in different research disciplines to enhance the proposed machine learning models, for instance IDS for vehicular ad hoc networks [23], drone intrusion detection [24], clickbait detection on social media [25], detection of diseases in health informatics [26] and virtual screening for molecular similarity searching [27]. In addition to machine learning methods for IDS in IoT, several deep learning methods were applied for intrusion detection systems in IoT, which are discussed in [28]. Although there are several studies in the literature addressing the IoT intrusion detection, more research efforts are needed to consider the special characteristics and challenges of IoT systems, which including the limited battery power and processor capacity. According to [5], it is needed to consider the trade-off between accuracy of detection and performance overheads to provide efficient computational and communication overhead, and high detection accuracy. Therefore, this paper proposes a feature selection-based method with several machine learning methods to enhance the performance of IoT Botnet classifiers. The feature selection methods include Mutual Information (MI), Principal Component Analysis (PCA) and ANOVA f-test at fine-granulated detection level.

Materials and Methods
In this section, the N-BaIoT benchmark dataset is presented and discussed briefly. The data preprocessing and label encoding processes are then explained. Then, the well-known One-versus-the-Rest (OvR) classification technique was used for dealing with multiclass classification problems. Finally, this section describes the methodology used, including details of the choice of classifiers, feature selection methods and the evaluation criteria.
The methodology followed in this research is presented in Figure 1, that includes: data collection, data preparation, feature selection and classifier selection, which is trained and tested on the benchmark dataset with hyper-parameter tuning of the ML models. To evaluate these models, the classifiers were trained and tested without applying any feature selection method. This step helped to measure the efficiency of the used feature selection techniques and investigate their influence on the performance of the ML model. In addition, two data preprocessing techniques were applied: standardization and minimum-maximum normalization (which is known as min-max normalization). Each attack type was then fed into the feature selection methods to obtain a set of reduced features. Subsequently, the set with reduced features was used for training the ML classifiers, using the OvR strategy. The hyper-parameter of the winner ML classifier was then tuned using k-fold cross. In the last phase, the performance of ML classifiers was reported.
Sensors 2022, 22, x FOR PEER REVIEW The methodology followed in this research is presented in Figure 1, that includes: data collection, data preparation, feature selection and classifier selection, which is trained and tested on the benchmark dataset with hyper-parameter tuning of the ML models. To evaluate these models, the classifiers were trained and tested without applying any feature selection method. This step helped to measure the efficiency of the used feature selection techniques and investigate their influence on the performance of the ML model. In addition, two data preprocessing techniques were applied: standardization and minimum-maximum normalization (which is known as min-max normalization). Each attack type was then fed into the feature selection methods to obtain a set of reduced features. Subsequently, the set with reduced features was used for training the ML classifiers, using the OvR strategy. The hyper-parameter of the winner ML classifier was then tuned using k-fold cross. In the last phase, the performance of ML classifiers was reported.

Used Dataset
The N-BaIoT data set that is used in this paper is designed to detect botnet attack types, using nine IoT devices that provided the real traffic data [29]. The IoT devices were attacked by two botnet attack families, namely Bashlite and Mirai. In total, there are about five million items of data, grouped in separate files. Each file contains 115 features and a class label. The dataset has also been constructed to server binary classification as well as multi-class classification, where the target class labels take values of "benign" or "TCP attack" for binary classification and "Bashlite" or "Miria" attack types for multi-class classification.

Used Dataset
The N-BaIoT data set that is used in this paper is designed to detect botnet attack types, using nine IoT devices that provided the real traffic data [29]. The IoT devices were attacked by two botnet attack families, namely Bashlite and Mirai. In total, there are about five million items of data, grouped in separate files. Each file contains 115 features and a class label. The dataset has also been constructed to server binary classification as well as multi-class classification, where the target class labels take values of "benign" or "TCP attack" for binary classification and "Bashlite" or "Miria" attack types for multi-class classification. Table 1 below and Table A1 (see Appendix A) show the detailed statistics of the N-BaIoT dataset and the complete list of extracted features. The data records are encoded as L0.01, L0.1, L1, L3 and L5 with respect to the network stream time windows. In addition, the socket and channel category are enriched with additional information about the packet size. For each category, the packet count, mean, packet size and variance are calculated From Table 1. it is obvious that the dataset is organized in a way that allows both binary classification and multi-class classification to be addressed. In this study, as mentioned earlier, the multi-class classification will be investigated, where the number of instances for benign and different attack subclass types is presented in Table 2.  As the distribution of data records is obviously not balanced, the pseudocode presented in Algorithm 1 was used to sample the instances of "Bashlite" attack types and "Mirai" attack types.

Algorithm 1 Pseudocode of Dataset Sampling
Input: A list of N − BaIoT files F Output: Balanced dataset DF ← an empty list s ← size of data frame for each file f ∈ F do: Import the file f as data frame d f

Data Preprocessing
Although data preprocessing is tedious and time consuming [30,31], its necessity is proven not only for simplifying the machine learning training process but also for improving the effectiveness of the overall processes. Consequently, this study proposes the following pre-preprocessing steps: label encoding, min-max normalization and standardization.

Label Encoding
As the class label contains 11 different categorical values (including one "Benign" class and 10 attack type subclasses), it is not acceptable to feed these values directly to the ML classifiers. Therefore, these features are encoded into numerical values before using the models. In the literature, there are several approaches for encoding the categorical values: one-hot encoding [32], ordinal encoding [33], similarity encoding [34], entity embedding [35] and multi-hot encoding [36]. Among of these, the most used approaches are one-hot and ordinal encoding [37]. For encoding the categorical values found in the class label, this study applies the one-hot encoding approach and transforms each categorical value into a vector of binary variables. It should be noted that applying a one-hot encoding approach leads to increasing the dimensionality by up to 10 more dimensions.

Normalization and Standardization
The performance of regression, as well as the classification models, is seriously affected if the dataset columns contain values with different ranges. Mahfouz et al. [37] discussed how this problem leads to the performance of ML models deteriorating when various imbalanced scales of features have occurred in the dataset. Therefore, to deal with such problems, it is necessary to obtain the acceptable range for the negligible and dominant values. The two most popular techniques are min-max normalization and z-score standardization: • Min-max normalization is used for transforming values of the dataset features into the range of [0, 1] according to the following equation: where X normalized represents the normalized value, X min_value and X max_value are the border range of the desired interval, which is in this study [0,1], and X is the original value that would be transformed within these ranges.
• Z-score standardization is used for rescaling dataset features so that they will have the properties of a standard normal distribution with mean µ = 0 and standard deviation σ = 1.
Algorithm 2 shows the pseudocode of the one-hot encoding approach, minimummaximum normalization and standardization techniques used in this study.
Algorithm 2 Pseudocode of One-hot encoding, Min-Max Normalization and Z-score Standardization

Feature Selection Techniques
As mentioned earlier, the N-BaIoT dataset consists of 115 features and 10 class labels, plus the "Benign" class that was added after encoding the target class. Passing this high dimensional vector into the ML model might cause a delay in the training and testing time of ML models. Consequently, any proposed attack detection system built with this issue usually consumes the processing resource very rapidly, which is not appropriate for the real-time systems. Therefore, the proposed approach first investigates how various filter-based feature selection techniques can be helpful for overcoming this issue. The impact of PCA, MI and the ANOVA f-test on the performance of ML models is explored. As presented in Section 4.1, the experimental results show that the MI filter-based technique yields the highest accuracy score when the binary dataset is used. An aggregated MI with different rank aggregation function is proposed and tested on the multi-class dataset (see Section 4.2). The idea behind the aggregated MI is described as follows: Compute the mutual information score for each feature, f i , in dataset D with respect to class type c ∈ C. The features are then ranked based on the aggregator functions listed in Table 3. Only p% of features are retained and fed later to the classifiers listed in Table 3 and the overall performance is measured.

Aggregators Formula Description
Min ( ) Selects the minimum of the relevance scores produced when class type c i is used as a target class Selects the maximum of the relevance scores produced when class type c i is used as a target class Selects the mean of the relevance scores produced when class type c i is used as a target class

Classification Algorithms
In this work, two types of ML classifiers are used: (i) two ensemble-based classifiers: Random Forest (RF), XGBoost (XGB) and (ii) four standalone classifiers, namely: Gaussian Naïve Bayes (GNB), k-Nearest Neighbor (k-NN), Logistic Regression (LR) and Support Vector Machine (SVM). For tuning the hyper-parameters of these classifiers, the optimal values are estimated by using cross validation [38]. Typically, there are several hyperparameter optimization techniques, among which the grid search, random search, Bayesian optimization and evolutionary-based optimization are commonly used techniques. In this work, the grid search was applied, and the results of the optimized process are shown in Table 4.

Model Evaluation Metrics
The most commonly used evaluation metrics were used to evaluate the performance of the ML classifiers, which are: Accuracy (Acc.), Precision (P), Recall (R) and F1 score. In addition to these metrics, the training time, prediction time and execution time of each classifier were computed. The full description of these metrics and how they are computed is presented in Table 5. Table 5. Evaluation metrics.

Measure Metric Formula Explanation
Accuracy (Acc.)

TP+TN TP+TN+FP+FN
TP-Correctly classified instances as the right type of attack.
TN-Correctly classified instances as benign. FN-Wrongly classified attack instances as benign. FP-Wrongly classified benign instances as an attack

F1 score is the harmonic mean of precision and recall
Execution time t e t e = t 1 + t p t 1 -Training time; t p -Prediction time

Preliminary Exploration Setup: Binary Dataset
To conduct the experiment, the script was written in Python 3.7 using the Google Colab environment on the 64-bit Windows 10 operating system. The N-BaIoT dataset was organized in a way such that both "Bashlite" and "Mirai" classes were grouped together and formed one class, "attacked". As shown in Figure 2, the number of the instances classified as "attacked" is much larger than the number of "benign" instances. Therefore, an under-sampling algorithm was applied on the class "attacked" to obtain a more balanced dataset. A balanced sample of the dataset was then used. Later, the obtained dataset was split into a training set and a testing set, using the train_test_split function found in the sklearn package, where 80% of data was used as the training dataset and the remaining data (20%) as the testing dataset. Table 6 presents the statistical outline of the balanced binary dataset used.
Sensors 2022, 22, x FOR PEER REVIEW addition to these metrics, the training time, prediction time and execution time of each classifier were computed. The full description of these metrics and how they are computed is presented in Table 5.

Preliminary Exploration Setup: Binary Dataset
To conduct the experiment, the script was written in Python 3.7 using the Google Colab environment on the 64-bit Windows 10 operating system. The N-BaIoT dataset was organized in a way such that both "Bashlite" and "Mirai" classes were grouped together and formed one class, "attacked". As shown in Figure 2, the number of the instances classified as "attacked" is much larger than the number of "benign" instances. Therefore, an under-sampling algorithm was applied on the class "attacked" to obtain a more balanced dataset. A balanced sample of the dataset was then used. Later, the obtained dataset was split into a training set and a testing set, using the train_test_split function found in the sklearn package, where 80% of data was used as the training dataset and the remaining data (20%) as the testing dataset. Table 6 presents the statistical outline of the balanced binary dataset used.     Table 7 presents the performance of the used ML classifiers. The idea here is to investigate how the feature selection technique performs on the proposed binary dataset.
Firstly, the ML-model is applied without using any FS technique. Then, different FS techniques are used. Table 7 shows the summarized performance of the ML classifiers in terms of accuracy.

Discussion
Based on the results presented in Table 7, the following findings are observed and can be summarized as follows: • k-NN and XGB classifiers yield the highest scores in terms of accuracy, which confirms the results reported in [20,21]. The k-NN exceeds all classifiers when all features are used.

•
The performance of the classifiers is degraded when the PCA technique is used. The only exception is noted when SVM is used, when the number of components of PCA is 21, as shown in Figure 3 and Table A2.

•
Most ML models benefit more when the MI feature selection technique is applied. The performance of ML classifiers in terms of accuracy exceeds the baseline, except LR, in which the performance decreased. As a result, the following section presents how MI can be beneficial for detecting attack types where the multi-class dataset is used. The proposed aggregated MI feature selection approach is highlighted.  Table 7 presents the performance of the used ML classifiers. The idea here is to investigate how the feature selection technique performs on the proposed binary dataset. Firstly, the ML-model is applied without using any FS technique. Then, different FS techniques are used. Table 7 shows the summarized performance of the ML classifiers in terms of accuracy.

. Discussion
Based on the results presented in Table 7, the following findings are observed and can be summarized as follows: • k-NN and XGB classifiers yield the highest scores in terms of accuracy, which confirms the results reported in [20,21]. The k-NN exceeds all classifiers when all features are used.

•
The performance of the classifiers is degraded when the PCA technique is used. The only exception is noted when SVM is used, when the number of components of PCA is 21, as shown in Figure 3 and Table A2.

•
Most ML models benefit more when the MI feature selection technique is applied. The performance of ML classifiers in terms of accuracy exceeds the baseline, except LR, in which the performance decreased. As a result, the following section presents how MI can be beneficial for detecting attack types where the multi-class dataset is used. The proposed aggregated MI feature selection approach is highlighted.

N-BaIoT Dataset as a Multi-Class Dataset
To conduct the experiment fairly, the OvR strategy was applied. The reason behind this selection is its computational efficiency and interpretability. The OvR strategy represents each class by only one classifier, which allows knowledge to be gained about the class by inspecting its corresponding classifier.
To obtain the MI score of the features in the multi-class dataset, as mentioned earlier, each feature in the dataset is computed with respect to each class type, ∈ , which

N-BaIoT Dataset as a Multi-Class Dataset
To conduct the experiment fairly, the OvR strategy was applied. The reason behind this selection is its computational efficiency and interpretability. The OvR strategy represents each class by only one classifier, which allows knowledge to be gained about the class by inspecting its corresponding classifier.
To obtain the MI score of the features in the multi-class dataset, as mentioned earlier, each feature in the dataset is computed with respect to each class type, c ∈ C, which means the target class is fixed using multiclass classification strategy (OvR) and the MI of the feature is computed with respect to this class type. As a result, each feature obtained 10 different MI scores. The features are then ranked based on the aggregator functions listed in Table 3. Figures 4-6 show the mutual information scores of all features with respect to the MAX, MIN and AVERAGE aggregation functions.
Sensors 2022, 22, x FOR PEER REVIEW means the target class is fixed using multiclass classification strategy (OvR) and the MI of the feature is computed with respect to this class type. As a result, each feature obtained 10 different MI scores. The features are then ranked based on the aggregator functions listed in Table 3. Figures 4-6 show the mutual information scores of all features with respect to the MAX, MIN and AVERAGE aggregation functions.   Sensors 2022, 22, x FOR PEER REVIEW means the target class is fixed using multiclass classification strategy (OvR) and the MI of the feature is computed with respect to this class type. As a result, each feature obtained 10 different MI scores. The features are then ranked based on the aggregator functions listed in Table 3. Figures 4-6 show the mutual information scores of all features with respect to the MAX, MIN and AVERAGE aggregation functions.   As shown in the Figures 4-6 above, each ranker search method ranks the attributes differently. The main issue with such methods, as with all filter-based FS methods, is that specifying the number of attributes that have to be retained is a subjective choice. In this work, only the top 10% of features were used that have the highest MI scores. Table 8 shows the names of the top 10% of features with respect to the aggregation functions. As shown in the Figures 4-6 above, each ranker search method ranks the attributes differently. The main issue with such methods, as with all filter-based FS methods, is that specifying the number of attributes that have to be retained is a subjective choice. In this work, only the top 10% of features were used that have the highest MI scores. Table 8 shows the names of the top 10% of features with respect to the aggregation functions.

Comparison of MI Feature Selection using Different Aggregation Functions
Based on these selected features, the performance of ML classifiers was now measured per each class type in terms of accuracy, precision, recall and F1score. In addition, the  Table 9 presents the accuracy of ML classifiers when features were selected based on different aggregation functions. Tables 10-15 present the precision, recall and F1 score of these classifiers.

Discussion
This section meticulously analyzes the results listed in Tables 9-15. It also measures the performance of the employed classifiers in terms of time consumption. As shown in Table 9, the classifiers benefited differently when different aggregation operators were applied. The findings are summarized as follows:

•
When the "MIN" and "AVERAGE" functions were used, the most of classifiers performed well and XGB, k-NN, GNB, LR and SVM achieved notable results compared to their results when the "MAX" operator was used. Among these methods, XGB obtained the best accuracy (99.19%).

•
In most cases of the experiments, all classifiers showed good results when the "AVER-AGE" operator was used as aggregation function, except RF and SVM. • It is notable that RF benefited more only when the "MAX" operator was used as an aggregation function. The performance of RF was degraded a little.

•
In terms of accuracy, XGB and k-NN classifiers achieved 99.19% and 98.28% respectively, which means that they are quite close. However, when their performances were measured in terms of time consumption, the preference tends to favor k-NN, since it consumes less time, as shown in Table 16.

•
The prediction time is also a very important factor for employing an ML classifier for real-time applications. Thus, in the case that the ML classifier is used for preventing attacks on IoT devices in real-time and sensitive intrusion detection systems, the favor tends toward XGB.  Tables 11 and 12 show the performance of the classifiers according to class types. The findings are summarized as follows: • Among all attack types, the XGB and k-NN classifiers were capable of detecting the "Mirai" attack type perfectly.
• Among the "Bashlite" attack types that XGB was able to detect, the "TCP" and "UDP" attack types were poorly detected, whilst the k-NN classifier performed poorly with "TCP" and "UDP" attack types, and also with "COMBO" and "Junk" attack types. • Interestingly, RF records the best performance with F1score of 100% for the "COMBO" attack type when the "AVERAGE" aggregation function was used. In addition, it achieved F1 score of 99.95% with the "Junk" type.

Conclusions
This paper has proposed an aggregated mutual information-based feature selection with machine learning methods for enhancing IoT botnet attack detection. The main phases of this method include data collection, data preparation, feature selection and classification using the N-BaIoT benchmark dataset. Each attack type was fed into the feature selection methods to obtain a set of reduced features. The set with reduced features was then used for training the ML classifiers using the OvR strategy. Finally, the ML model was evaluated and the overall performance was reported. The proposed method was applied for the binary (attack and benign) and multi-class (10 different attacks and benign) classification problems. The effect of PCA, MI and ANOVA f-test feature selection methods on the performance of ML models was investigated. Two ensemble-based classifiers: RF and XGB, and four individual classifiers: GNB, k-NN, LR and SVM methods with applying hyper-parameter methods were used in the conducted experiments. The evaluation of ML classifiers was performed by computing the accuracy, precision, recall and F1score. In addition to these metrics, the training time, prediction time and execution time of each classifier were computed. The experimental results showed that the MI filter-based technique yielded the highest accuracy score when the dataset of binary dataset was used. For the multi-class dataset, an aggregated MI with different rank aggregation functions was proposed and tested. The findings showed that, in terms of accuracy, XGB and k-NN classifiers achieved 99.19% and 98.28% respectively, while k-NN performed better for time consumption measure. Future works can apply the proposed method on different IoT botnet datasets. In addition, deep learning-based methods can be proposed and investigated to enhance IoT botnet attack detection.